Skip to content

[dev] split fused moe experts to ensure quantization#2464

Open
liwei109 wants to merge 1 commit intovllm-project:mainfrom
liwei109:model_free
Open

[dev] split fused moe experts to ensure quantization#2464
liwei109 wants to merge 1 commit intovllm-project:mainfrom
liwei109:model_free

Conversation

@liwei109
Copy link

SUMMARY:
This PR adds a split_fused_moe_experts function for model-free quantization. This ensures that models with fused MoE layers (e.g., Qwen3.5 and Qwen3-VL) containing fused gate_up_proj and down_proj weights can be effectively quantized.

TEST PLAN:
We add an example of Qwen3.5 for testing purpose.

Signed-off-by: Li Wei <liwei.109@outlook.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the model-free post-training quantization (PTQ) capabilities by addressing a specific challenge with Mixture-of-Experts (MoE) models that have fused expert weights. By introducing a mechanism to split these fused weights into their individual components, the PR ensures that such models can be accurately and effectively quantized, expanding the range of models supported by the quantization framework.

Highlights

  • MoE Expert Splitting: Introduced a new split_fused_moe_experts function to correctly handle and split fused Mixture-of-Experts (MoE) layer weights (specifically gate_up_proj and down_proj) from 3D tensors into individual 2D expert tensors. This is crucial for models like Qwen3.5 and Qwen3-VL to enable proper quantization.
  • Model-Free Quantization Integration: Integrated the split_fused_moe_experts function into the model_free_ptq process, ensuring that fused MoE layers are pre-processed before quantization, thereby enabling effective W8A8 quantization for these models.
  • Qwen3.5 Quantization Example: Added a new example script for quantizing the Qwen3.5-35B-A3B model using the model_free_ptq entrypoint with the W8A8 scheme, demonstrating the practical application of the new MoE expert splitting logic.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • examples/model_free_ptq/qwen3.5_int8.py
    • Added a new example script for quantizing the Qwen3.5-35B-A3B model using W8A8 scheme.
  • src/llmcompressor/entrypoints/model_free/process.py
    • Added the split_fused_moe_experts function to handle and split fused MoE layer weights.
    • Integrated the split_fused_moe_experts function into the process_file function to apply the splitting before quantization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the documentation Improvements or additions to documentation label Mar 11, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to split fused MoE experts, which is a necessary step for quantizing models like Qwen3.5. The approach is sound, but I've found a few critical bugs in the implementation of split_fused_moe_experts. Specifically, the logic for generating new tensor names is incorrect and will lead to invalid keys. Additionally, there's a case where a tensor could be inadvertently dropped. I've also included some suggestions to improve code clarity and logging practices. Please address the identified bugs to ensure the feature works as expected.

Comment on lines +222 to +224
print(f"Warning: gate_up_proj {name} has odd second dimension: {tensor.shape}")
continue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are two issues here:

  1. When a gate_up_proj tensor has an odd second dimension, the code continues without adding the original tensor to the _tensors dictionary. This will cause the tensor to be dropped from the model, which is a significant bug.
  2. The use of print for logging is not ideal for a library. Please use a proper logger, like loguru, which is used elsewhere in the project. This would also require importing logger at the top of the file.

The print statements on lines 240 and 252 should also be converted to logger.info for consistency.

Suggested change
print(f"Warning: gate_up_proj {name} has odd second dimension: {tensor.shape}")
continue
logger.warning(f"Warning: gate_up_proj {name} has odd second dimension: {tensor.shape}, skipping split")
_tensors[name] = tensor
continue

Comment on lines +236 to +238
base_key = name.replace("mlp.experts.gate_up_proj", f"mlp.experts.{expert_idx}")
_tensors[base_key + ".gate_proj.weight"] = gate_proj
_tensors[base_key + ".up_proj.weight"] = up_proj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for generating new tensor names for the split gate_proj and up_proj experts is incorrect. It constructs the new key by appending to the original tensor name which already includes .weight, resulting in an invalid key like ...weight.gate_proj.weight. You should remove the .weight suffix from the original name before creating the new keys.

Suggested change
base_key = name.replace("mlp.experts.gate_up_proj", f"mlp.experts.{expert_idx}")
_tensors[base_key + ".gate_proj.weight"] = gate_proj
_tensors[base_key + ".up_proj.weight"] = up_proj
base_key = name.rsplit(".weight", 1)[0].replace("mlp.experts.gate_up_proj", f"mlp.experts.{expert_idx}")
_tensors[base_key + ".gate_proj.weight"] = gate_proj
_tensors[base_key + ".up_proj.weight"] = up_proj

Comment on lines +249 to +250
new_key = name.replace("mlp.experts.down_proj", f"mlp.experts.{expert_idx}") + ".down_proj.weight"
_tensors[new_key] = down_proj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the gate_up_proj case, the key generation for down_proj is incorrect. It appends .down_proj.weight to a name that already ends with .weight. Please correct this by first removing the original .weight suffix.

Suggested change
new_key = name.replace("mlp.experts.down_proj", f"mlp.experts.{expert_idx}") + ".down_proj.weight"
_tensors[new_key] = down_proj
new_key = name.rsplit(".weight", 1)[0].replace("mlp.experts.down_proj", f"mlp.experts.{expert_idx}") + ".down_proj.weight"
_tensors[new_key] = down_proj

Returns:
New dictionary with split expert weights
"""
_tensors = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable _tensors is unconventional for a local variable. A more descriptive name like processed_tensors or new_tensors would improve readability. The _ prefix is usually reserved for internal/private variables by convention. This change would need to be applied throughout the function.

print(f"Warning: gate_up_proj {name} has odd second dimension: {tensor.shape}")
continue

hidden_size = tensor.shape[1] // 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable hidden_size is confusing. Based on the comment on line 220 (gate_up_proj is typically [num_experts, 2*intermediate, hidden]), tensor.shape[1] is 2 * intermediate_size. Therefore, this variable actually holds the intermediate_size. Renaming it would improve clarity. You will also need to update its usages on lines 232 and 233.

Suggested change
hidden_size = tensor.shape[1] // 2
intermediate_size = tensor.shape[1] // 2

expert_tensor = tensor[expert_idx] # [2*hidden, intermediate]

# Split gate and up projections
gate_proj = expert_tensor[:hidden_size, :]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be a little more readable to just do the full indexing at the same time instead of doing it piecemeal, these are all views so performance will be equivalent

Returns:
New dictionary with split expert weights
"""
_tensors = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename as fused_tensors

Copy link
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments and bot comments, did you run the test successfully, it looks like the x.weight name bug should have produced an error, no?

@mergify
Copy link
Contributor

mergify bot commented Mar 11, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation quality-failed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants